SETUP

FXN: Save graphical outputs

picsave <- function(graph, name) {
  ggsave(plot = graph, filename= name, device = "pdf", width = 12, height = 8, path = "~/GitHub/S_Lipkind_Rundergrad2020/week3/pics/")
}

NOTES

CHAPTER 4

N/A

CHAPTER 5

PROBLEMS

CHAPTER 4 PROBLEMS

4.4

  1. Why does this code not work?
my_variable <- 10
my_varıable
## [1] 10
#> Error in eval(expr, envir, enclos): object 'my_varıable' not found

Was this written by someone who speaks Turkish or something? Not sure how else someone could use ı instead of i.

  1. Tweak each of the following R commands so that they run correctly:
a <- ggplot(data = mpg) + #dota -> data
  geom_point(mapping = aes(x = displ, y = hwy))
picsave(a, "4.4.2 graph.pdf")

filter(mpg, cyl == 8) #= -> ==
filter(diamonds, carat > 3) # diamond -> diamonds
  1. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

Oh my goodness, that is amazing. An entire list of keyboard shortcuts. You could also reach that page by going to Help > Keyboard Shortcuts Help.

CHAPTER 5 PROBLEMS

5.2.4

Find all flights that…

#### 1.1: had an arrival delay of two or more hours

head(flights)
filter(flights, arr_delay >= 2)

#### 1.4: Departed in summer (July, August, and September)

filter(flights, month %in% c(7,8,9))

#### 1.5: Arrived more than two hours late, but didn’t leave late

filter(flights, arr_delay > 2 & dep_delay == 0)

#### 1.7: Departed between midnight and 6am (inclusive)

(d <- filter(flights, dep_time >= 0, dep_time <= 600))

#### 2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

?between It’s an inclusive shortcut to find values within a certain range.

e <- filter(flights, dep_time %in% between(dep_time,0, 600))
# d == e

#### 3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

filter(flights, dep_time %in% NA) # 8255 rows/flights
#these observations also all have NA dep_delay, arr_time, arr_delay.
#Hypothesis: cancelled flights

5.3.1

How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

arrange(flights, desc(is.na(dep_delay))) #this works, but is it the intended solution?

Sort flights to find the most delayed flights. Find the flights that left earliest.

arrange(flights, dep_time, desc(dep_delay))

Sort flights to find the fastest (highest speed) flights.

y <- arrange(flights, desc(distance), air_time)
#(select(y, distance, air_time)) -> double-checking

Which flights travelled the farthest? Which travelled the shortest?

(longest_distance <- top_n(flights, 10, distance))
(shortest_distance <- top_n(flights, -10, distance))

5.4.1

Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

#one option: select()

What happens if you include the name of a variable multiple times in a select() call?

select(flights, day, month, day, month, dep_delay, dep_delay)

Repeating variable names does not appear to make a difference. Only the sorting of the initial appearance of each name within the list matters.

What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay") 
select(flights, one_of(vars))

one_of() allows one to make a character vector with specific column names that you can then select for.

Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))

The results aren’t too surprising, though I didn’t realize select was not case-sensitive. If I wanted to specify case, I could add the specifier below:

select(flights, contains("time", ignore.case = FALSE))

5.5.2

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

head(flights)

Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

What does 1:3 + 1:10 return? Why?

What trigonometric functions does R provide?